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Abstract 

We develop a novel distributed algorithm for the minimum cut problem. We primarily 
aim at solving large sparse problems. Assuming vertices of the graph are partitioned into 
several regions, the algorithm performs path augmentations inside the regions and updates 
of the push-relabel style between the regions. The interaction between regions is considered 
expensive (regions are loaded into the memory one-by-one or located on separate machines 
in a network). The algorithm works in sweeps - passes over all regions. Let B be the set 
of vertices incident to inter-region edges of the graph. We present a sequential and parallel 
versions of the algorithm which terminate in at most 2\B\ 2 + 1 sweeps. The competing 
algorithm by Delong and Boykov uses push-relabel updates inside regions. In the case of 
a fixed partition we prove that this algorithm has a tight 0(n 2 ) bound on the number of 
sweeps, where n is the number of vertices. We tested sequential versions of the algorithms 
on instances of maxflow problems in computer vision. Experimentally, the number of sweeps 
required by the new algorithm is much lower than for the Delong and Boykov's variant. 
Large problems (up to 10 8 vertices and 6 • 10 8 edges) are solved using under 1GB of memory 
in about 10 sweeps. 
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"Needless to say, their version not only 
has its own real beauty, but is somewhat 
"sexy" running depth first search on the 
layered network constructed by (extended) 
breadth first search" - 

Y. Dinitz, about Even and Itai's version of 
the maximum flow algorithm. Citation. 

1 Introduction 

Minimum s-t cut (mincut) is a classical combinatorial problem with applications in many 
areas of science and engineering. This research was motivated by wide use of mincut/maxflow 
problems in computer vision, where large sparse instances need to be solved. In some cases an 
applied problem is formulated directly as a mincut. More often, however, mincut problems 
in computer vision originate from the energy minimization framework (maximum a posteriori 
solution in a Markov random field model). A large subclass of Energy minimization is formed by 
submodular problems, which reduce to mincut [HI [32] . Instances originating from submodular 
energies can be very large if the number of labels in the energy minimization is large. When the 
energy minimization is intractable, mincut is employed in relaxation and local search methods. 
The linear relaxation of pairwise energy minimization with two labels reduces to mincut [21 [23] as 
well as the relaxation of problems reformulated in two labels [2T] . Expansion-move, swap-move [9] 
and fusion-move [29] algorithms formulate a local improvement step as a mincut problem. 

Many applications of mincut in computer vision use graphs of a regular structure, with vertices 
arranged into an N-D grid and edges uniformly repeated (e.g. 3D segmentation models [SI IH E], 
3D reconstruction models [2HI El I2Z])- Because of this regular structure, the graph itself need 
not be stored in the memory, only the edge capacities, allowing relatively large instances to be 
solved by such a specialized implementation. However, in many cases, it is advantageous to have 
a non-regular structure (e.g. using an adaptive tetrahedral volume in 3D reconstruction [26, 19J). 
Such applications would benefit from a large-scale generic mincut solver. 

The previous research mostly focused on speeding up mincut by parallel computations. The 
following important distinction is to be made: the parallel computational model assumes that there 
are several computation units which share the same memory, whereas the distributed computa- 
tional model assumes that the computation units have their own separate memory and exchanging 
the information between them is expensive. A distributed algorithm has therefore to divide the 
computation and the problem data between the units and keep the communication rate low. We 
will consider distributed algorithms, operating in the following two practical usage modes: 

• Sequential (or streaming) mode, which uses a single computer with a limited memory and a 
disk storage, reading, processing and writing back a part of data at a time. Since it is easier 
for analysis and implementation, this mode will be the main focus of this work. 

• Parallel mode, in which the units are computers in a network. We show that sequential 
algorithms we consider admit full parallelization and prove the correctness and termination 
of the parallel versions. We also propose their experimental evaluation and comparison to 
two other state-of-the-art methods. 
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To represent the cost of information exchange between the units, we use a special related measure 
of complexity. We call a sweep the event when all units of a distributed algorithm recalculate 
their data once. The number of sweeps is roughly proportional to the amount of communication 
in the parallel mode or disk operations in the streaming mode. 

Previous Work. A variant of path augmentation algorithm was shown in [6] to have the 
best performance on computer vision problems among sequential solvers. There were several 
proposals how to parallelize it. Partially distributed implementation jHUj augments paths within 
disjoint regions first and then merges regions hierarchically. In the end, it still requires finding 
augmenting paths in the whole problem. Therefore it cannot be used to solve a large problem by 
distributing it over several computers or by using a limited memory and a disk storage. For the 
shared memory model there was reported [30J a near-linear speed-up with up to 4 CPUs for 2D 
and 3D segmentation problems. 

A distributed algorithm was obtained in [33] using dual decomposition approach. The sub- 
problems are mincut instances on the parts of the graph (regions) and the master problem is 
solved using the subgradient method. This approach requires solving mincut subproblems with 
real valued capacities and does not have a polynomial bound on the number of iterations. The 
integer algorithm proposed in [3l] is not guaranteed to terminate. Our experiments (Sect. 7.3) 
showed that it did not terminate on some of the instances in 1000 sweeps. In Sect. 10 we relate 
dual variables in this method to flows. 

The push-relabel algorithm [17] performs many local atomic operations, which makes it a good 
choice for a parallel or distributed implementation. A distributed version [TJ] runs in 0(n 2 ) time 
using 0(n) processors and 0(n 2 y/m) messages. However, it is crucial to implement gap relabel 
and global relabel heuristics for good practical performance [JU]. The global relabel heuristic can 
be parallelized [JJ, but it is difficult to distribute. We should note however, that in the experiments 
with computer vision problems we made, the global relabel heuristic was not essential. Delong 
and Boykov [JTJ proposed a coarser granulation of push-relabel, associating a subset of vertices 
(a region) to each processor. Push and relabel operations inside a region are decoupled from the 
rest of the graph. This allows to process several non-interacting regions in parallel or run in a 
limited memory, processing few regions at a time. The gap and relabel heuristics, restricted to 
the regions [TT] are powerful and distributed at the same time. 

Contribution. We revisit the algorithm of Delong and Boykov [IT] in the case of a fixed 
partition into regions. We study a sequential variant and a novel parallel variant of their algorithm, 
which allows computation on neighboring interacting regions to run concurrently using a conflict 
resolution similar to the asynchronous parallel push-relabel [J4] . We prove that both variants have 
a tight 0(n 2 ) bound on the number of sweeps. We then construct a new algorithm, which works 
with the same partition of the graph into regions but is guided by a different distance function 
than push-relabel. 

Given a fixed partition into regions, we introduce a distance function which counts the number 
of region boundaries crossed by a path to the sink. Intuitively, it corresponds to the amount 
of costly operations - network communications or loads-unloads of the regions in the streaming 
mode. The algorithm maintains a labeling, which is a lower bound on the distance function. 
Within a region, we first augment paths to the sink and then paths to the boundary nodes of 
the region in the order of their increasing labels. Thus the flow is pushed out of the region in 
the direction given by the distance estimate. We present a sequential and parallel versions of the 
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algorithm which terminate in at most 2\B\ 2 + 1 sweeps, where B is the set of all boundary nodes 
(incident to inter- region edges). 

Other Related Work. The following works do not consider a distributed implementation but 
are relevant to our design. Partial Augment-Relabel algorithm (PAR) [15] in each step augments 
a path of length k. It may be viewed as a lazy variant of push-relabel, where actual pushes 
are delayed until it is known that a sequence of k pushes can be executed. The algorithm [16] 
incorporates the notion of a length function and a valid labeling w.r.t. this length. It can be seen 
that the labeling maintained by our algorithm corresponds to the length function assigning 1 to 
boundary edges and to intra-region edges. In [16] this generalized labeling is used in the context 
of blocking flow algorithm but not within push-relabel. 



2 Mincut and Push-Relabel 



We will be solving mincut problem by finding a maximum preflow 1 , In this section, we give 
basic definitions and introduce the push-relabel framework |17j . 

By a network we call the tuple G = (V, E, s,t, c, e), where V is a set of vertices; E C V x V, 
thus (V, E) is a directed graph; s,t G V, s ^ t, are source and sink, respectively; c: E — > N 
is a capacity function; and e: V\{s,t} — > No is an excess function. Excess can be equivalently 
represented as additional edges from the source, but we prefer this explicit form. For convenience 
we let e(s) = oo and e(t) = 0. We also denote n = \V\ and m = \E\. 

For X, Y C Vwe will denote (X, Y) = E n (X x Y). For C C V such that s e C, t <£ C, the 
set of edges (C, C), with C = V\C is called an s-t cut. The mincut problem is 



dud <; c(u, v) + e(v) 

(u,v)£(C,C) v£C 



C cV, s eC, t eC 



The objective is called the cost of the cut. Without a loss of generality, we assume that E is 
symmetric - if not, the missing edges are added and assigned zero capacity. 
A preflow in G is a function / : E — > Z satisfying the following constraints: 

f(u,v) < c(u,v) V(u,v) G E (capacity constraint), (2a) 
f(u,v) = —f(u,v) V(u,v) G E (antisymmetry), (2b) 

e ( v ) + f( u ,v)>0 Wv G V (preflow constraint). (2c) 

u I (u,v)£E 

A residual network w.r.t. preflow / is a network Gf = (V,E,s,t,Cf,ef) with the capacity and 
excess functions given by 

c f = c-f, (3a) 
e f (v) = e(v)+ f^ v ^ VveV\{t}. (3b) 

u I (u,v)£E 



1 A maximum preflow can be completed to a maximum flow using flow decomposition, in 0(m log m) time. 
Because we are primarily interested in the minimum cut, we do not consider this step or whether it can be 
distributed. 
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By constraints (2) it is Cf > and e/ > 0. The costs of all s-t cuts differ in G and G/ by a constant 
called the flow value, \f\ = f(u,t). Network Gf is thus up to a constant equivalent to 

network G and |/| is a trivial lower bound on the cost of a cut. Dual to mincut is the problem 
of maximizing this lower bound, i.e. finding a maximum preflow: 

max |/| s.t. constraints (2). (4) 

We say that w G V is reachable from v G V in network G if there is a path (possibly of length 
0) from v to w composed of edges with strictly positive capacities. This relation is denoted by 
v — y w. If w is not reachable from v we write v -/» w. For any X, Y C V, we write X — > Y if 
there exist x G X , y G Y such that x — > y. Otherwise we write X -/> Y. 

A preflow / is maximum iff {t> | e(u) > 0} t in G/. In that case the cut (T,T) with 
T = {v£V\v — > t in Gf} has value in Gf. Because all cuts are non-negative it is a minimum 
cut. 

A Distance function d* : V — » No in G assigns to v £ V the length of the shortest path from v 
to t, or n if no such path exists. A shortest path cannot have loops, thus its length is not greater 
than n — 1. Let us denote d°° = n. 

A labeling d: V — > {0, . . . , d°°} is valid in G if d(t) = and d(u) < d(v) + 1 for all (u, v) G E 
such that c(tt, v) > 0. Any valid labeling is a lower bound on the distance d* in G. Not every 
lower bound is a valid labeling. A vertex v is called active w.r.t. (/, c?) if e/(t> ) > and d(v) < d°°. 

All algorithms in this paper will use the following common initialization. 

Procedure Init 

1 / := preflow saturating all ({s}, V) edges; G := Gf, f := 0; 

2 d:=0, d(s) := d°°; 

The generic push-relabel algorithm jTTj starts with Init and applies the following Push and 
Relabel operations while possible: 

• Push(u,t>) is applicable if u is active and Cf(u,v) > and d(u) = d{v) + 1. The operation 
increases f(u,v) by A and decreases f(v,u) by A, where A = min(e/(w), Cf(u, v)). 

• Relabel(w) is applicable if u is active and Vt> | (u, v) G E, Cf(u,v) > it is d(u) < d(v). It 
sets d{u) := min (d°°, min{d(v) + 1 | (u, v) G E, Cf(u, v) > 0}) . 

If u is active then either Push or Relabel operation is applicable to u. The algorithm preserves 
validity of labeling and stops when there are no active nodes. Then for any u such that ef(u) > 0, 
we have d(u) = d°° and therefore d*(u) = d°° and u t in Gf, so / is a maximum preflow. 

3 Region Discharge Revisited 

We now review the approach of Delong and Boykov [TT] and reformulate it for the case of a 
fixed graph partition. We then describe generic sequential and parallel algorithms which can be 
applied with both push-relabel and augmenting path approaches. 

Delong and Boykov [11] introduce the following operation. The discharge of a region R C 
V\{s,t} applies Push and Relabel to v G R until there are no active vertices left in R. This 
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localizes computations to R and its boundary, denned as 

B R = {w | 3u G R (u,w) G E,w ^ R, w / s,t}. (5) 

When a Push is applied to an edge (v,w) G (R,B R ), the flow is sent out of the region. We say 
that two regions Ri, R 2 C U\{s, t} interact if (R\, R 2 ) ^ 0. Discharges of non-interacting regions 
can be performed in parallel since the computations in them do not share the data. The algorithm 
proposed in [TT] repeats the following steps until there are no active vertices in V: 

1. Select several non- interacting regions, containing active vertices. 

2. Discharge the selected regions in parallel, applying region-gap and region-relabel heuristics. 

3. Apply global gap heuristic. 

All heuristics (global-gap, region-gap, region-relabel) serve to improve the distance estimate. 
They are very important in practice, but do not affect theoretical properties and will be discussed 
in Section 5, devoted to the implementation. 

While the regions in [TT] are selected dynamically in each iteration trying to divide the work 
evenly between CPUs and cover the most of the active nodes, we restrict ourselves to a fixed 
collection of regions (Rk)^ =1 forming a partition of V\{s, t} and let each region-discharge to work 
on its own separate subnetwork. We define a region network G R = (V R , E R ,s,t,c R ,e R ), where 
V R = R U B R U {s, t}; E R = (RU {s, t}, R U {s, t}) U (R, B R ) U (B R , R); c R (u, v) = c(u, v) if 
(u, v) G E R \(B R , R) and otherwise; e R = e\R U { Stt } (the restriction of function e to its subdomain 
R U {s, t}). This network is illustrated in Fig. 1(b). Note that the capacities of edges coming from 




Figure 1: (a) Partition of a network into 4 regions, (b) Region Network 

the boundary, (B R , R), are set to zero. Indeed, these edges belong to a neighboring region network. 
The region discharge operation of [TT] . which we refer to as Push-relabel Region Discharge (PRD), 



can 


now be defined as follows. 




Procedure (/, d) = PRD(G R ,d) 


/* assume d: V R -» {0, . . . , d°°] valid in G R 


*/ 


i while 3v G R active do 




2 


apply Push or Relabel to v 


/* changes / and d */ 


3 


apply region gap heuristic (Section 5) 


/* optional */ 
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Generic Region Discharge Algorithms. We now define generic sequential Alg. 1 and par- 
allel Alg. 2, which use a black-box Discharge function as a subroutine. The sequential algorithm 
takes regions one-by-one from the partition and applies the Discharge operation to them. For 
non-interacting regions, their Discharge operations are independent and can be executed in par- 
allel. The sequential algorithm can be implemented as several phases, where in each phase a 
subset of non-interacting regions from the partition is taken and processed in parallel. The num- 
ber of phases required in a general case will correspond to the minimal coloring of the region 
interaction graph. Instead, our parallel algorithm calls Discharge for all regions concurrently 
and then resolves conflicts in the flow similarly to the asynchronous parallel push- relabel [TTj . A 
conflict occurs if two interacting regions increase their labels on the vertices of a boundary edge 
(u,v) simultaneously and try pushing flow over it. In such a case, we accept the labels, but do 
not allow the flow to cross the boundary in one of the directions. 

In the case Discharge is PRD the sequential and parallel algorithms are implementing the push- 
relabel approach and will be referred to as S-PRD and P-PRD respectively. S-PRD is a sequential 
variant of [TT] and P-PRD is a novel variant, based on results of [17] and [TT] . 

Algorithm 1: Sequential Region Discharge 



Init 

while there are active vertices do 
for k — 1, . . . K do 

Construct G Rk from G 
(f,d') :=mschar g e(G R \d\ v n k ) 
G := Gf 
d\n k := d'\ Rk 

apply global gap heuristic (Section 5) 



/* a sweep */ 



/* apply /' to G */ 
/* update labels */ 
/* optional */ 



Algorithm 2: Parallel Region Discharge 



1 Init 

2 while there are active vertices do 



Discha.rge(G Rk , d\ V R k ) VA; 



d'\R k ■= 4k Vfc 

a{u,v) := ld'(u) < d'(v) + 1] \/{u,v) e (B, B) 
/* fuse flow 

a( v , u )fk( u i v ) + v )fj(% v) if (it, v) e (R k , Rj) 
f k {u,v) if (it, v) e (Rk, Rk) 

G '.= Gf 
d := d' 

global gap heuristic (Section 5) 



f(u, v) :-- 



/* a sweep */ 
/* in parallel */ 
/* fuse labels */ 
/* valid pairs */ 
*/ 



/* apply /' to G */ 
/* update labels */ 
/* optional */ 



We prove below that both S-PRD and P-PRD terminate with a valid labeling in at most 2n 2 
sweeps. Parallel variants of push-relabel p2] have the same bound on the number of sweeps. 
However, they perform much simpler sweeps, processing every node only once, compared to S/P- 
PRD. A natural question is whether the 0(n 2 ) bound is not too loose for S/P-PRD. In Appendix 
A we give an example of a graph, its partition into two regions and a sequence of valid Push and 
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Relabel operations, implementing S/P-PRD which takes 0(n 2 ) sweeps to terminate. The set of 
inter-region edges in this example is also constant, which shows that a better bound in terms of 
these characteristics is not possible. 

3.1 Complexity of Sequential Push- Relabel Region Discharge 

Our proof follows the main idea of the similar result for parallel push-relabel in [13J. The main 
difference is that we try to keep Discharge operation as abstract as possible. Indeed, it will be 
seen that proofs of termination of other variants follow the same pattern, using several important 
properties of the Discharge operation, abstracted from the respective algorithm. Unfortunately, 
to this end we do not have a unified proof, so we will analyze all cases separately. 

Statement 1 (Properties of PRD). 
Let (f',d') = PRD(GV), then 

1. there are no active nodes in R w.r.t. (f',d') (optimality) 

2. d' > d; d'\ B R = d\ B R (labeling monotony) 

3. d! is valid in G R (labeling validity) 

4. f'(u,v) > =>• d'(u) > d(v), V(u,v) G E R (flow direction) 

Proof. 1. Optimality. This is the stopping condition of PRD. 

2,3. Labeling validity and monotony: labels are never decreased and the Push operation 
preserves labeling validity |17] . Labels not in R are not modified. 

4. Flow direction: let f'(u, v) > 0, then there was a push operation from u to v in some step. 
Let d be the labeling on this step. We have d(u) = d(v) + 1. Because labels never decrease, 
d'{u) > d{u) > d{v) > d{y). □ 

These properties are sufficient to prove correctness and the complexity bound of S-PRD. They 
are abstract from the sequence of Push and Relabel operation done by PRD and for a given pair 
(/', dl) they are easy to verify. For correctness of S-PRD we need to verify that it maintains a 
labeling, which is globally valid. 

Statement 2. Let d be a valid labeling in G. Let /' be a preflow in G R and d' a labeling in G R 
satisfying properties 2 and 3 of Statement 1, Extend /' to E by letting f'\E\E R — an d extend 
d! to V by letting d'\v\R = d. Then d' is valid in Gf. 

Proof. We have that d' is valid in G R . For edges (u,v) G (V\R,V\R) labeling d' coincides with 
d and f'(u, v) = 0. It remains to verify validity on edges (v, u) G (B R , R) in the case cf(v, u) = 
and Cf(v,u) > 0. (These are the incoming boundary edges which are zeroed in the network G R ). 
Because = c R (v, u) = c R (v, u) — f(v, u) = —f(v, u), we have Cf(v, u) = c(v, u). Since d was valid 
in G, d{y) < d(u) + 1. The new labeling d' satisfies d'{u) > d{u) and d'{v) = d(v). It follows that 
d'(v) = d{y) < d{u) + 1 < d'(u) + 1. Hence d' is valid in G f >. □ 

We can now state the termination. 

Theorem 1. Sequential PRD terminates in at most 2n 2 sweeps. 

Proof. • Value of d does not exceed n for every node. 

• Because there are n nodes, d can be increased at most n 2 times. 

• Let $ = max{d(v) \ v G V, v is active in G }. This value may go up and down during the 
algorithm, but the total number of times it can change is bounded. 

1. Each sweep the increase of $ is no more than the total increase of d. 
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Let us consider a discharge of region Rk- Let (f',d!) be the the flow and labeling 
computed by the discharge. Let /' be extended to E by setting f\E\E R — and 
d! be extended to V by setting d'\ v \ R = d\ v \ R . Let G' — Gy and $' be the new 
function after applying the discharge. We need to show that 

$' _ $ < [d'(v) - d(v)\ (6) 

Let the maximum in the definition of $' be achieved at a node t>, so $' = cf(t>). 
Then either v ^ Rk U B Rk , in which case $' < $ (because the label and the 
excess of v in G and G" are the same), or v G Rk U -B Rfc and there exists a path 
(i> , i>i, ■ ■ ■ vi), vi = v, v Q , . . . vi-i e R k , such that /'(^i-i, ^) > 0, i = 1 . . . Z and w 
is active in G. We have $ > rf(f ), therefore 

i 

$' - $ < d'(ui) - d(uo) = - rf'(^-i)] + [d'(v ) - d(v )\ 

i=i 

(a) 



< 



^K(^)-^)] (7) 



=0 



(6) 



veR k uB R k veR k 

where inequality (a) is due to the flow direction property (statement 14) which 
implies d'(vi-i) > d(v{), the inequality (b) is due to monotony property (state- 
ment 1,2) and to v j C Rk U and the equality (c) is due to d'\ B R k = d\ B n k . 

Summing over all regions, which are disjoint, we obtain the claim. 

2. If d has not increased during a sweep (d' = d) then $ decreases at least by 1. Indeed, let 
us consider the set of vertices having the label greater or equal to the label of the highest 
active node, H — {v \ d(v) > $}. These vertices do not receive flow during all discharge 
operations due to the flow direction property. After discharging Rk, there are no active 
vertices in Rk D H (statement 1.1). Therefore, there are no active vertices in H after the 
full sweep. 

In the worst case, $ can increase by one n 2 times and decrease by one n 2 times. In at most 2n 2 
sweeps there is no active excess. □ 

Once the algorithm terminated with network G, equivalent to the initial one, we have that 
labeling d is valid in G and there are no active vertices. Hence the cut (T, T), with T — {v \ v — > 
t in G} is a minimum cut. 

3.2 Complexity of Parallel Push-Relabel Region Discharge 

Let us show that the following properties hold for a sweep of P-PRD. 

Statement 3. Let d be a valid labeling in the beginning of a sweep of P-PRD. Then the pair of 
fused flow and labeling (/', d') satisfies: 

1. d! > d] (labeling monotony) 



10 



2. d! is valid in Gf>; (labeling validity) 

3. f'(u,v) > =>- d'(u) > d(v), y(u,v) G E ; (flow direction) 

Proof. 1. We have d' Rk > d\ R k for all k. 

2. We have to prove validity for the boundary edges, where the flow and the labeling are fused 
from different regions. It is sufficient to study the two regions case. Denote the regions R\ 
and i?2- The situation is completely symmetric w.r.t. orientation of a boundary edge (u,v). 
Let u G R± and ti£i? 2 . Let only d'{v) < d'(u) + 1 be satisfied and not d'(u) < d'(v) + 1. By 
the construction in step 5 of Alg. 2 flow f 2 is canceled and f'(u,v) = f[(u,v) > 0. Suppose 
Cf/ (u, v) > 0, then we have that d'^u) < d[(v) + 1, because d[ is valid in G^, 1 . It follows that 
d'(u) = d[(u) < d[(v) + 1 = d{v) + 1 < d' 2 (v) + 1 = d'{y) + 1, where we also used labeling 
monotonicity property. The inequality d'{u) < d'{v) + 1 is a contradiction, therefore it must 
be that Cf(u,v) = 0. The labeling d' is valid on (u,v) in this case. Note that inequalities 
d'{v) < d'(u) + 1 and d'(u) < d'(v) + 1 cannot be violated simultaneously. In the remaining 
case, when both are satisfied, the labeling is valid for arbitrary flow on (u,v), so no flow is 
canceled. 

3. If f'(u,v) > then f' k {u,v) > and there was a push operation from u to v in the 
discharge of region Rk 3 u. Let dk be the labeling in G^ on this step. We have d'{u) > 
d k {u) = d k (v) + 1 > d(v) + 1 > d(v). 

□ 

Theorem 2. Parallel PRD terminates in at most 2n 2 sweeps. 

Proof. As before, total increase of d is at most n 2 . Let us verify that if d has increased during a 
sweep, then increase in $ is no more than total increase of d. Consider the pair (/', d') of fused 
flow and labeling, constructed by Parallel PRD in a sweep. As shown above, this pair satisfies 
properties 2-4 of statement 1 for region R = V\{s, t} and may be considered as a single Discharge 
operation on this region. Part 1 of the proof of Theorem 1 can be applied, considering region R 
and (/', d') on it. 

If d has not increased during a sweep (<f = d) then no relabel operation has occurred and 
a(u, v) = 1 for all (w, v) G B. Moreover, for each (u,v) G (Rk,Rj) either f' k (u,v) = or /j(-u,f) = 
so no flow is canceled by a or by the opposite flow on the fusing step. Let H — {v \ d(v) > $}. 
These vertices do not receive flow during all elementary push operations. After the sweep, all 
active vertices which were in H are discharged. Because there is no active vertices with label $ 
or above left, it is $' < By the same argument as in Theorem 1, the algorithm will terminate 
in at most 2n 2 sweeps. □ 

4 Augmented Path Region Discharge 

We will now use the same setup of the problem distribution, but replace the discharge operation 
and the labeling function. 

4.1 New Distance Function 

Let the boundary w.r.t. partition (Rk) k=1 be the set B = [j k B Rk . The region distance d* B {u) in 
G is the minimal number of inter-region edges contained in a path from u to t, or \B\ if no such 
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path exists: 



d* B {u) 



min IP n (23,23)1 if u -»> t, 

P=((u, Ul ),...,(Ur,t)) 



\B\ 



if u -f* t. 



This distance corresponds well to the number of region discharge operations required to transfer 
the excess to the sink (see Figure 2(a)). 



d* B (u) = 2 





(a) (b) 

Figure 2: (a) Illustration of region distance, (b) Illustration of Lemma 1: augmentation on pathes 
from x to u or from v to y preserves X Y, but not the augmentation on the red path. 

Statement 4. If u ->• t then d* B {u) < \B\. 

Proof. Let P be a path from u to t given as a sequence of edges. If P contains a loop then it 
can be removed from P and \P D (23, B)\ will not increase. A path without loops goes through 
each vertex at most once. For B C V there is at most \B\ — 1 edges in the path which have both 
endpoints in B. □ 

We now let d°° = \B\ and redefine a valid labeling w.r.t. to the new distance. A labeling 
d: V — > {0, . . . , d°°} is valid in G if d{t) = and for all (w, v ) G E such that c(w, i> ) > 0: 



d(u) < d(v) + 1 
d{u) < d{v) 



if (u,v)e(B,B), 
if (u,«)^(B,B). 



(9) 
(10) 



Statement 5. A valid labeling d is a lower bound on d* B . 
Proof. If u t then d{u) < d* B . Otherwise, let P = ((u, V\), . 



d* B , %.e. d* B [ 



(vi,t)) be a shortest path w.r.t. 



\P fl (B, B)\. Applying the validity property to each edge in this path, we have 



d(u) < d(t) + \Pn(B,B)\ = d* B (u) 
4.2 New Region Discharge 



□ 



In this subsection, reachability relations "—>■", "W, residual paths, and labeling validity will 
be understood in the region network G R or its residual Gj. 

The new Discharge operation, called Augmented Path Region Discharge (ARD), works as 
follows. It first pushes excess to the sink along augmenting paths inside the network G R . When 
it is no longer possible, it continues to augment paths to nodes in the region boundary, B R , in 
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the order of their increasing labels. This is represented by the sequence of nested sets T = {t}, 
7\ = {t} U {v G B R | d(v) = 0}, . . . , T d oo = {t} U {v G B R \ d(v) < d°°}. Set T k is the destination 
of augmentations in stage k. As we prove below, in stage k > residual paths may exist only to 
the set T fe \T fc _x = {v | d(v) = k - 1}. 

Procedure AKD(G R ,d) 

/* assume d: V R ->■ {0, . . . , cf°} valid in G R */ 

1 for A; = 0, 1, . . . , d°° do /* stage k */ 

2 T fc = {t} U {v E B R I < A;}; 

3 Augment (R, Tk); 

/* Region-relabel */ 

{min{/c | -u — >■ T fc } u G -R, m — > T d oo , 
u e R,u ^ T d oo , 
d(u) u G 5 R . 

Procedure Augment(X,F) 

l while £/iere exzsi a pai/i (fo,fi, • • • , i>j), C/(fj_i,fj) > 0, ej(v ) > 0, v G X, Vi G Y do 
augment A = mm(ef(v ), min c/(^j_i, i;,)) units along the path. 



The labels on the boundary, o^n, remain fixed during the algorithm and the labels d\n inside 
the region do not participate in augmentations and therefore are updated only in the end. 

We claim that ARD terminates with no active nodes inside the region, preserves validity and 
monotonicity of the labeling, and pushes flow from higher labels to lower labels w.r.t. the new 
labeling. These properties will be required to prove finite termination and correctness of S-ARD. 
Before we prove them (Statement 9) we need the following intermediate results: 

• Properties of the network G R maintained by the algorithm (Statement 6, Corollaries 1 
and 2). 

• Properties of valid labellings in the network G R (Statement 7). 

• Properties of the labeling constructed by region-relabel (line 4 of ARD) in the network G R 
(Statement 8). 

Lemma 1. Let X, Y C V R , X fl Y = 0, X -» Y . Then X -/» Y is preserved after i) augmenting a 
path (x, . . . ,v) with x G X and v G V R ; ii) augmenting a path (v, . . . , y) with y G Y and v G V R . 

Proof. (See Figure 2(b)). Let X be the set of vertices reachable from X. Let 3^ be the set of 
vertices from which Y is reachable. Clearly X D y — 0, otherwise X — > Y. We have that (X, X) 
is a cut separating X and Y and having all edge capacities zero. Any residual path starting in X 
or ending in Y cannot cross the cut its augmentation change the edges of the cut. Hence, X and 
Y will stay separated. □ 

Statement 6. Let v G V R and v T a in Gf in the beginning of stage k , where a, k G 
{0, 1, . . . d°°}. Then v T a holds until the end of the algorithm. 

Proof. We need to show that v T a is not affected by augmentations performed by the algorithm. 
If ko < a, we first prove v T a holds during stages ko < k < a. Consider augmentation of a 
path (u , Ui, . . . , ui), u G R, ui G T k C T a , ef(u ) > 0. Assume v T a before augmentation. 
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Figure 3: (a) Reachability relations in the network G R at the end of stage 1 of ARD: {v \ ef(v) > 
0} Ti; Tdoo\Ti -/» R. (b) Example of a path which may exist in the network G R , by Corollary 2 
it must be d(v) < d{w). 

By Lemma 1 with X = {v}, Y = T a (noting that X Y and the augmenting path ends in Y), 
after the augmentation v T a . By induction, it holds till the end of stage a and hence in the 
beginning of stage a + 1. 

We can assume now that k > a. Let A = {u G R | e/(w) > 0}. At the end of stage 
k — 1 we have A Tfc _i D T a by construction. Consider augmentation in stage k on a path 
(uq,u\ . . . ,ui), uo G R, ui G Xfc , ef(uo) > 0. By construction, w G A. Assume {v} U A T a 
before augmentation. Apply Lemma 1 with X = {v} U A, Y = T a (we have X Y and 
uq G A C X). After augmentation it is X T a . By induction, X T a till the end of stage k$. 
By induction on stages, v T a until the end of the algorithm. □ 

Corollary 1. If w G B R then w -/» Td( w ) throughout the algorithm. 

Proof. At initialization, it is fulfilled by construction of G R due to c R (B R , R) = 0. It holds then 
during the algorithm by Statement 6, □ 

In particular, we have B R t during the algorithm. 

Corollary 2. Let (u, v± . . . vi, w) be a residual path in G R from u G R to w G B R and let v r G -B R 
for some r. Then d(u r ) < 

Proof. We have f r ^> T Vr . Suppose d(w) < d(v r ), then w G and because v r — > w it is u r — >■ T Up 
which is a contradiction. □ 

The properties of the network G R established by Statement 6 and Corollary 2 are illustrated 
in Figure 3. 

Statement 7. Let d be a valid labeling, d(u) > 1, u G R. Then w ->*» Td( u )_i. 

Proof. Suppose n — >■ T . Then there exist a residual path («, V\ . . . Vi,t), Vt G R (by Corollary 1 it 
cannot happen that V{ G B R ). By validity of d we have < d(v\) < ■ ■ ■ < <i(f/) < = 0, 
which is a contradiction. 

Suppose d(u) > 1 and u — > Td( u )-i- Because u T , it must be that u — > w, w G B R and 
d(w) < d(u) — 1. Let (v q . . . v{) be a residual path with v o = u and uj = w. Let r be the minimal 
number such that v r G B R . By validity of d we have cf(tt) < d(vi) < • • • < d(u r _i) < rf(t> r ) + 1. 
By corollary 2 we have d(v r ) < rf(io), hence rf(-u) < <i(w) + 1 which is a contradiction. □ 

Statement 8. For d computed on line 4 and any u G R it holds: 
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1. d is valid; 

2. u^T a ^ d(u) > a + 1. 

Proof. 1. Let (w,t>) G and c(w,t>) > 0. Clearly u v. Consider four cases: 

• case u E R, v <E B R : Then w — >■ Td(„)+i, hence d(w) < d(v) + 1. 

• case u E R, v E R: H v T doo then d(t>) = d°° and d(u) < d(t>). If i> — > T d °°, 
then d(i>) = min{A; \v — >■ T fe }. Let fc = d(t>), then t> — >■ T fc and -u — >• T fc , therefore 
d(w) < k = d{y). 

• case -u G v E R: By Corollary 1, ii T d(u ). Because u — >■ it is v T d ( u ), 
therefore d(v) > d{u) + 1 and d(u) < d{y) — 1 < d(v) + 1. 

• case when u — t or v — t is trivial. 

2. The "<^=" direction follows by Statement 7 applied to d, which is a valid labeling. The 
direction: we have u T a and rf(-u) > min{/c | -u — > T k } = min{A; > a \ u — > T fc } > a + 1. □ 

Statement 9 (Properties of ARD). Let d be a valid labeling in G^. The output (f',d r ) of ARD 
satisfies: 

1. There are no active vertices in i? w.r.t. (/', d'); (optimality) 

2. d' > d, d'lsfl = d| B A; (labeling monotonicity) 

3. d' is valid in G^,; (labeling validity) 

4. /' is a sum of path flows, where each path is from a vertex « G to a vertex v G {£} U B R 
and it is d'(u) > d(v) if f G B R . (flow direction) 

Proof. 1. In the last stage, the algorithm augments all paths to T^oo. After this augmentation 
a vertex u E R either has excess or there is no residual path to T d oo and hence d'(u) = d°° 
by construction. 

2. For d(u) = 0, we trivially have d'(u) > d(u). Let d(u) = a + 1 > 0. By Statement 7, 
■u -n T a in G R and it holds also in G R by Statement 6, From Statement 8,2, we conclude 
that d'(u) > a + 1 = d(u). The equality d'\ B R = d\ B R is by construction. 

3. Proven by Statement 8.1. 

4. Consider a path from u to v G augmented in stage > 0. It follows that k = d(v) + 1. 
At the beginning of stage k it is u -» T k -i- By Statement 6, this is preserved till the end of 
the algorithm. By Statement 8.2, d'(u) > k = d(v) + 1 > d(v). □ 

Algorithm 1 and 2 for Discharge being ARD will be referred to as S-ARD and P-ARD, respec- 
tively. 

4.3 Complexity of Sequential Augmented Path Region Discharge 

Statement 2 holds for S-ARD as well, so S-ARD maintains a valid labeling. 
Theorem 3. S-ARD terminates in at most 2\B\ 2 + 1 sweeps. 

Proof. The value of d(v) does not exceed \B\ and d is non-decreasing. The total increase of d\ B 
during the algorithm is at most \B\ 2 . 

After the first sweep, active vertices are only in B. Indeed, discharging region R k makes all 
vertices v G R k inactive and only vertices in B may become active. So by the end of the sweep, 
all vertices V\B are inactive. 

Let us introduce the quantity 

$ = max{d(f ) \ veB, v is active in G }. (11) 
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We will prove the following two cases for each sweep but the first one: 

1. If d\j3 is increased then the increase in $ is no more than total increase in d\s- Consider 
discharge of R k . Let $ be the value before ARD on R k and $' the value after. Let <£>' = d'(v). 
It must be that v is active in G'. If v ^ V Rk , then d(t>) = d'(v) and e(t>) = e//(i>) so 
$ > = 

Let f G V^ Rfc . After the discharge, vertices in R k are inactive, sow G -B fifc and it is d'(v) = 
d(v). If v was active in G then $ > d(v) and we have $' — $ < d'(v) — d(v) = 0. If v 
was not active in G, there must exist an augmenting path from a vertex vq to i> such that 
i>o G R k nB was active in G. For this path, the flow direction property implies d'(vo) > d(i>). 
We now have < d» - d(u ) = d(v) - d(v ) < d'{v )-d{v ) < E^nsK^) ~ d(v)]. 

Summing over all regions, we get the result. 

2. If d\j3 is not increased then <3> is decreased at least by 1. We have d' = d. Let us consider the 
set of vertices having the highest active label or above, H — {y \ d(v) > $}. These vertices 
do not receive flow during all discharge operations due to the flow direction property. After 
the discharge of Rk there are no active vertices left in R k D H (property 9,1). After the full 
sweep, there are no active vertices in H. 

In the worst case, starting from sweep 2, $ can increase by one \B\ 2 times and decrease by one 
\B 2 \ times. In at most 2\B\ 2 + 1 sweeps, there are no active vertices left. □ 

On termination we have that the labeling is valid and there are no active vertices in G. 
4.4 Complexity of Parallel Augmented Path Region Discharge 

Statement 10 (Properties of Parallel ARD). Let d be a valid labeling in the beginning of a 
sweep of P-ARD. Then the pair of fused flow and labeling (/', d') satisfies: 

1. Vertices in V\B are not active in Gf>. (optimality) 

2. d' > d] (labeling monotony) 

3. d' is valid; (labeling validity) 

4. /' is the sum of path flows, where each path is from a vertex u G V to a vertex v G B, 
satisfying d'(u) > d(v). (weak flow direction) 

Proof. 1. For each k there are no active vertices in R k w.r.t. (f' k ,d' k ). The fused preflow /' 
may differ from f' k only on the boundary edges (u, v) G (£>, B). So there are no active vertices 
in V\B w.r.t. (f',d'). 

2. By construction. 

3. Same as in P-PRD. 

4. Consider the augmentation of a path from u G R k to v G B Rk during ARD on G Rk and 
canceling of the flow on the last edge during the flow fusion step. Let the last edge of the 
path be (w, v). We need to prove that d'(u) > d(w). Let d be the labeling in G k right before 
augmentation, as if it was computed by region-relabel. Because d is valid it must be that 
d(w) < d(v) + 1. We have d' k (u) > d(v) > d(v) > d(w) - 1 > d(w) - 1. So d'(u) > d(w). 

Theorem 4. Parallel algorithm 2 with ARD terminates in 2\B\ 2 + 1 sweeps. 

Proof. As before, total increase of d\s is at most \B\ 2 . 

After the first sweep, active vertices are only in B by Statement 10.1. 

For each sweep after the first one: 

• If d\j3 is increased then increase in $ is no more than the total increase of d\&- 
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Let <£>' be the value in the network G = Gf>. Let $' = d'(v). It must be that v is 
active in G' and v a B. 

If v was active in G then $ > d(v) and we have $' — $ < d'(v) — d(i>). 
If f was not active in G then there must exist a path flow in /' from a vertex vq to i> 
such that v q G B was active in G. For this path, the weak flow direction property 
implies d'(v ) > d(v). We have $' - $ < d'(u)-d(u ) = d'(v)-d(v)+d(v)-d(v ) < 
d'(v) - d( w ) + d'(v ) - d(v ) < Ev&[*(v) ~ d(v)]. 
• If die is not increased then $ is decreased at least by I. 

In this case, /' satisfies the strong flow direction property and the proof of Theo- 
rem 3 applies. 

After total of 2\B\ 2 + 1 sweeps, there are no active vertices left. 



5 Implementation 

In this section we first discuss heuristics commonly used in the push-relabel framework. They 
are essential for the practical performance of the algorithms. We then describe our base imple- 
mentations of S-ARD/S-PRD and the solvers they rely on. In the next section we describe an 
efficient implementation of ARD, which is more sophisticated but has a much better practical 
performance. 

5.1 Heuristics 

Region-relabel heuristic computes labels d\u of the region vertices, given the distance estimate 
on the boundary, d\ B n. There is a slight difference between PRD and ARD variants (using distance 
d* and d* B , resp.), displayed by the corresponding if conditions. 



Algorithm 3: Region-relabel^", d 



B 



R 



1 d(t) := 0; O := {t}; d\ R := cf°; cf urrcnt := 0; /* init */ 

2 if ARD then d\ B n := d\ B n + 1; /* (for ARD) */ 
/* O is a list of open vertices, having the current label rf current */ 

s d max := max{dH \ w e B R , d(w) < d°°}; 
4 while O ^ or d CUTTCnt < d max do 

/* if O is empty raise rf currcnt to the next seed */ 
if O = then cf urrcnt := mm{d(w) \ w G B R , d{w) > d CUTTen \ d(w) < d°°}; 

/* add seeds to the open set */ 
O := O U {w G B R | d(w) = cf urrcnt }; 

/* find unlabeled vertices from which O can be reached */ 

7 O := {u G R | (u, v) G E R , v G O, c(u, v) > 0, d{u) = d°°}; 

s if PRD then d CUTTent <- d CUTrcnt + 1; /* (for PRD) */ 

9 [ d\ := d cmrent ; /* label them */ 

10 if ARD then d\ B n := d\ B n - 1; /* (for ARD) */ 

In the implementation, the set of boundary vertices is sorted in advance, so the algorithm runs 
in 0(\E R \ + \V R \ + \B R \ log \B R \) time and uses 0(\V R \) space. The resulting labeling d! is valid 



17 



and satisfies d! > d for arbitrary valid d. 

Global gap heuristic. Let us briefly explain the global gap heuristic [TU]. It is a sufficient 
condition to identify that the sink is unreachable from a set of nodes. Let there be no nodes with 
label g > 0: Wv G V d(v) ^ g, and let d(u) > g. For a valid labeling d, it follows that there is no 
node v for which c(u, v) > and d(v) < g. Assuming there is, we will have d{u) < d(v) + 1 < g, 
which is a contradiction. Therefore the sink is unreachable from all nodes {u \ d(u) > g} and their 
labels may be set to d°°. 

Region gap heuristic [TTJ detects if there is no nodes inside region R having label g > 
0. Such nodes can be connected to the sink in the whole network only through one of the 
boundary nodes, so they may be relabeled up to the closest boundary label. Here is the algorithm 2 : 



Algorithm 4: Region-gap(G R , d, g) 


/* Input: region network G R , labeling d, gap g: \/v G R d(v) ^g 


*/ 


i d next := mm{d(w) w G B R , d(w) > g}; 




2 for v G R such that g < d(v) < d next do 




3 d{y) := d ncxt +l 





If no boundary vertex is above the gap, then d next = d°° in step 1 and all nodes above the gap 
are disconnected from the sink in the network G. Interestingly, this sufficient condition does not 
imply a global gap. In PRD, we detect the gap efficiently after each node relabel operation, by 
discovering an empty bucket (see below). 



5.2 Referenced Implementations 

Boykov-Kolmogorov (BK). The reference augmenting path implementation [6]. There is 
only a trivial 0(mn 2 \C\) complexity bound known for this algorithm 3 , where C is the minimum 
cut value. 

Highest level Push- Relabel (HIPR). is reference push-relabel implementation [10] (v3.6, 
: / / www . avglab . com/andrew/ sof t . html ) . This implementation has two stages: finding the 
maximum preflow / minimum cut and upgrading the maximum preflow to a maximum flow. Only 
the first stage was executed and benchmarked. We tested two variants with frequency of the global 
relabel heuristic equal to 0.5 (the default value in HIPR v3.6) and equal to 0. These variants will 
be denoted HIPR0.5 and HIPR0 respectively. HIPR0 executes only one global update at the 
beginning. Global updates are essential for difficult problems. However, HIPR0 was always faster 
than HIPR0.5 in our experiments with real test instances. This is a good indication that global 
relabel and region-relabel heuristics are not essential for PRD as well. The worst case complexity 
is 0(n 2 \fm). 

5.3 S/P-ARD implementation 

The basic implementation of ARD simply invokes BK solver as follows. On stage we compute 
the maximum flow on the G R by BK, augmenting pathes from source to the sink. On the stage 

2 Region-gap-relabel in [HI fig. 10] seems to contain an error: only nodes above the gap should be processed in 
step 3. 

3 The worst-case complexity of breads-first search shortest path augmentation algorithm is just 0(m\C\). Tree 
adaptation step, introduced in [5] to speed-up the search, does not have a good bound and introduces additional 
0(n 2 ) factor. 
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k, infinite capacities are added from the boundary nodes having label k — 1 to the sink, using 
the possibility of dynamic changes in BK. The flow augmentation to the sink is then continued, 
reusing the search trees. The relabel procedure is implemented as Alg. 3, In the beginning of next 
discharge we clear the infinite link from boundary to the sink and continue as above. Some part 
of the sink search tree, linked through the boundary nodes, get destroyed, but the larger part of 
it and the source search tree are reused. A more efficient implementation is described in Sect 6, 
It includes additional heuristics and maintenance of separate boundary search trees. 

S-ARD. In the streaming mode we keep only one region in the memory at a time. After 
a region is processed by ARD all the internal data structures have to be saved to the disk and 
cleared from memory until the region is discharged next time. We manage this by allocating all 
the region's data into a fixed page in the memory, which can be saved and loaded, preserving 
the pointers. By doing the load/unload manually (rather than relying on the system swapping 
mechanism), we can accurately measure the pure time needed for computation (CPU) and the 
amount of disk I/O. We also can use 32bit pointers with larger problems. 

When the sequential algorithm 1 terminates, it has found an optimal preflow. However, we still 
need to do some extra work to find an optimal cut. On termination the labeling d is only a lower 
bound on the distance to the sink. Therefore if d(v) = d°° we are sure that v -» t in G and hence 
v must be in the source set, but if d(v) < d°° it is still possible that v t in G. Therefore we 
execute several extra sweeps, performing only region-relabel and gap heuristic, until labels stop 
changing. In practice it takes from to 2 extra sweeps. 

A region with no active vertices is skipped. The global gap heuristic is executed after each 
region discharge. Because it is based on labels of boundary nodes only, it is sufficient to maintain 
a label histogram with \B\ bins to implement it. S-ARD uses 0(\B\ + \(B, B)\) "shared" memory 
and 0(\V R + E R \) "region" memory, to which regions are loaded one at a time. 

To solve large problems, which do not fit in the memory, we have to create region graphs without 
ever loading the full problem. We implemented a tool called splitter, which reads the problem 
from a file and writes edges corresponding to the same region to the region's separate "part" file. 
Only the boundary edges (linking different regions) are withheld to the memory. 

P-ARD. We implemented this algorithm for a shared-memory system using OpenMP language 
extension. All regions are kept in memory, the discharges are executed concurrently in separate 
threads, while the gap heuristic and messages exchange are executed synchronously by the master 
thread. 

5.4 S/P-PRD implementation 

To solve region discharge subproblems in PRD in the highest label first fashion we designed a 
special reimplementation of HIPR, which will be denoted HPR. We intended to use the original 
HIPR implementation to make sure that PRD relies on the state-of-the art core solver. It was not 
possible directly. A subproblem in PRD is given by a region network with fixed distance labels 
on the boundary (let us call them seeds). Distance labels in PRD may go up to n in the worst 
case. The same applies to region subproblems as well. Therefore, keeping an array of buckets 
corresponding to possible labels (like in HIPR), would not be efficient. It would require 0(|V|) 
memory and an increased complexity. However, because a region has only \V R \ nodes, there are 
no more than \V R \ distinct labels at any time. This allows to keep buckets as a doubly-linked 
list with at most \V R \ entries. Highest label selection rule and the region-gap heuristic can then 
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be implemented efficiently with just a small overhead. We tried to keep other details similar to 
HIPR (current arc data structure, etc.). HPR with arbitrary seeds has the worst case complexity 
0(|l/ R | 2 y^E^j) and uses OQU^I + \V E \) space. When the whole problem is taken as a single 
region then HPR should be equivalent to HIPRO. Though the running time on the real instances 
can be somewhat different. 

S-PRD. Our reimplementation of Delong and Boykov [11] for an arbitrary graph and a fixed 
partition, using HPR as a core solver. It uses the same memory model, paging mechanism and 
the splitter tool as S-ARD. The region discharge is always warm-started. We found it inefficient 
to run the region-relabel after every discharge. In the current experiments, we run it once at 
the beginning and then only when a global gap is discovered. To detect a global gap, we keep 
a histogram of all labels, 0(n) memory, and update it after each region discharge (in 0(|V K |) 
time). In practice, this 0(n) memory is not a serious limitation - labels are usually well below 
n. If they are not then we should consider a weaker gap heuristic with a smaller number of bins. 
Applying the gap (raising the corresponding nodes to d°°) for all regions is delayed until they are 
loaded. So we keep the track of the best global gap detected for every region. Similar to how the 
sequential algorithm 1 represents both S-ARD and P-ARD, it constitutes a piece of generic code 
in our implementation, where the respective discharge procedure and gap heuristics are plugged. 

P-PRD. Implementation of parallel PRD for shared- memory system with OpenMP. 

6 Efficient Implementation of ARD 

The basic implementation of S-ARD, as described in the previous section, worked reasonably 
fast (comparable to BK) on simple problems like 2D stereo and 2D random segmentation (Sec. 7.1 ). 
However, on some 3D problems the performance was unexpectedly bad. For example, to solve 
LB07-bunny-lrg instance (Sec. 7.2) the basic implementation required 32 minutes of CPU time. 
In this section we describe an efficient implementation which is more robust and is comparable in 
speed with BK on all tested instances. In particular, to solve LB07-bunny-lrg it takes only 15 
seconds of CPU time. The problem why the basic implementation is so slow is in the nature of the 
algorithm: sometimes it has to augment the flow to the boundary, without knowing of whether 
it is a useful work or not. If the particular boundary was selected wrongly the work is wasted. 
This happens in LB07-bunny-lrg instance, where the data seeds are sparse. The huge work is 
performed on pushing the flow around in the first few iterations, before a reasonable labeling is 
established. We introduce two heuristics how to overcome this problem: the boundary-relabel 
heuristic and partial discharges. An additional speed-up is obtained by dynamically maintaining 
boundary search trees and the current labeling. 

6.1 Boundary Relabel Heuristic 

We would like to have a better distance estimate, but we cannot run a global relabel because 
implementing it in a distributed fashion would take several full sweeps, which would be too 
wasteful. Instead, we go for the following cheaper lower bound. Our implementation keeps all the 
boundary edges (including their flow and distance labels of the adjacent vertices) in the shared 
memory. Fig. 4(a) illustrates this boundary information. We want to improve the labels by 
analyzing only this boundary part of the graph, not looking inside the regions. Since we don't 
know how the vertices are connected inside the regions, we have to assume that every boundary 
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vertex might be connected to any other within the region, except of the following case. If u and v 
are in the same region R and d(u) > d(v) then we know for sure that u v in R. It follows from 
validity of labeling d. We now can calculate a lower bound on the distance d* B in G assuming 
that all the rest of the nodes are potentially connected within the regions. 




(a) (b) (c) 

Figure 4: Boundary relabel heuristic: (a) Boundary vertices of the network and a valid labeling. 
Directed arcs correspond to non-zero residual capacities. Vertices without numbers have label 
d°° and do not participate in the construction, (b) Vertices having the same label are grouped 
together within each region and arcs of zero length (of red color) are added from a group to the 
next label's group. It is guaranteed that e.g. vertices with label 1 are not reachable from vertices 
with label 2 within the region, hence there is no arc 2— >1. Black arcs have the unit length, (c) 
The distance in the auxiliary graph is a valid labeling and a lower bound on the distance in the 
original network. 

If d(v) = d(u) we have to assume that v — > u and u — > v in R, therefore the new lower bound 
for u and v will coincide. Hence we group vertices having the same label within a region together 
as shown in Fig. 4(b). In the case d(v) < d(u) we know that u -/> v but have to assume v — > u in 
R. We thus add a directed arc of length zero from the group of v to the group of u (Fig. 4(b)). 
Let d\ < g?2 < d 3 be labels of groups within one region. There is no need to create an arc d±— >d3, 
because two arcs d%-^d2 and o^— ^3 of length zero are an equivalent representation. Therefore it is 
sufficient to connect only groups having consequent labels. Let us denote thus constructed graph 
G. We can calculate the distance to nodes with label in G by running Dijkstra's algorithm in 
G. Let this distance be denoted d'. We then update the labels as 

d{u) :— max{d(u), d'(u)}. (12) 

We have to prove the following two points: 

1. d' is a valid labeling; 

2. If d and d' are valid labellings, then max(d, d') is a valid. 

Proof. 1. Let c(u, v) > 0. Let u and v be in the same region. It must be that d(u) < d(v). 
Therefore either u and v are in the same group or there is an arc of length zero from group 
of u to group of v. In any case it must be d'(u) < d'{v). If u and v are in different regions, 
there is an arc of length 1 from group of u to group of v and therefore d'(u) < d!{y) + 1. 
2. Let l(u,v) = 1 if (u,v) G (B,B) and l(u,v) = otherwise. We have to prove that if 
c(u, v ) > then 

max{d(u), d'{u)} < max{<i(t>), d'{y )} + l(u, v). (13) 
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Let max{d(u), d'{u)} = d{u). From validity of d we have d{u) < d(v)+l(u,v). If d{v ) > d'(v), 
then ma.x{d(v ), d'{y)} = d{y) and (13) holds. If d{y) < d'(v), then d{u) < d{v) + l{u,v) < 
d'{y) + l(u,v) and (13) holds again. 

The complexity of this algorithm is 0(\(B,B)\). It is relatively inexpensive and can be run after 
each sweep. 

6.2 Partial Discharges 

Another heuristic which proved very efficient was simply to postpone path augmentations to 
higher boundary vertices to further sweeps. In combination with boundary-relabel this allows to 
save a lot of unnecessary work. More precisely, on sweep s the algorithm ARD is allowed to execute 
only stages up to s. This way, in sweep only paths to the sink are augmented and not any path 
to the boundary. Nodes which cannot reach the sink (but can potentially reach the boundary) 
get label 1. These initial labels may already be improved by boundary-relabel. In sweep 1 paths 
to the boundary with label are allowed to be augmented and so on. 

6.3 Boundary Search Trees 



(a) (b) (c) 




Figure 5: Search trees, (a) A region with some residual arcs. The region has only 3 boundary 
nodes, for simplicity, (b) Search trees of the sink and boundary nodes: vertex v is in the tree with 
the lowest possible label of the root. The sink is assigned a special label —1. The source search 
tree is empty in this example, (c) Labels of the inner vertices are determined as the tree's root 
label+1. 

We now redesign the algorithm so that not only the sink and source search trees are maintained 
but also the search trees of boundary nodes. This allows to save computation when the labeling of 
many boundary nodes remains constant during the consequent sweeps, but only a small fraction 
is changed. Additionally, knowing the search tree for each inner vertex of the region determines 
its actual label, so the region-relabel procedure becomes obsolete. The design of the search tree 
data structures, their updates and other detail are the same as in [22], only few changes to the 
implementation are necessary. For each vertex v G R we introduce a mark d(v) which corresponds 
to the root label of its tree or is set to a special free mark if v is not in any tree. For each tree 
we keep a list of open vertices (called active in [22])- A vertex is open if it is not blocked by the 
vertices of the trees with the same or lower root label. The trees may grow only at the open 
vertices. 

Figure 5 shows the correspondence between search trees and the labels. The sink search tree 
is assigned label —1. In the stage of ARD we grow the sink tree and augment all found paths 
if it touches the source search tree. Vertices, which are added to the sink tree are marked with 
label d — — 1. In stage k + 1 of ARD we grow trees with root at a boundary vertices w with label 
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d(w) = k, all vertices added to the tree are marked with d = k. When the tree touches the source 
search tree, the found path is augmented. If the tree touches a vertex u with label d{u) < k, it 
means that u is already in the search tree with a lower root and no action is taken. It cannot 
happen that during growth of a search tree with root label k a vertex is reached with label d > k, 
this would contradict to properties of ARD. The actual label of a vertex v at any time is determined 
as d(v) + 1 if v e R and d(v) if v e B R . 

Let us now consider the situation when region R has build some search trees and the label of a 
boundary vertex w is risen from d(w) to d'(w) (as a result of update from the neighboring region 
or one of the heuristics). All the vertices in the search tree starting from w were previously marked 
with d(w) and have to be declared as free vertices or adopted to any other valid tree with root 
label d(w). The adaptation is performed by the same mechanism as in BK. The situation when a 
preflow is injected from the neighboring region and (a part of) a search tree becomes disconnected 
is also handled by the orphan adaptation mechanism. 

The combination of the above improvements allows S-ARD to run in about the same time as 
BK on all tested vision instance (Sect. 7.2), sometimes being even significantly faster (154 s. vs. 
245 s. on BL06-gargoyle-lrg). 

7 Experiments 

All experiments are conducted on a system with Intel Core 2 Quad CPU@2.66Hz, 4GB memory, 
Windows XP 32bit and Microsoft VC compiler. There is 3 series of experiments: 

• Synthetic experiments, where we observe general dependencies of the algorithms, with some 
statistical significance, i.e. not being biased to a particular problem instance. It also serves as 
an empirical validation, as thousands of instances are solved. Here, the basic implementation 
of S-ARD was used. 

• Sequential competition. We study sequential versions of the algorithms, running them on 
real vision instances. Only a single core of the CPU is utilized. We fix the region partition 
and study how much disk I/O it would take to solve each problem when only one region 
can be loaded in the memory at a time. In this and the next experiment we used the 
efficient implementation of ARD. Note, in [33] we reported worse results with the earlier 
implementation. 

• Parallel competition. Parallel algorithms are tested on the instances which can fully fit in 
2GB of memory. All 4 cores of the CPU are allowed to be used. We compare our algorithms 
with two other state-of-the-art distributed implementations. 

7.1 General Dependences: Synthetic Problems 

We generated simple synthetic problems to validate the algorithms. The network is constructed 
as a 2D grid with a regular connectivity structure. Figure 6(a) shows an example of such a network. 
The edges are added to the nodes at the following relative displacements (0, 1), (1,0), (1,2), (2, 1), 
(1,3), (3,1), (2,3), (3,2), (0,2), (2,0), (2,2), (3,3), (3,4), (4,2). By connectivity we mean the number 
of edges incident to a node far enough from the boundary. Adding pairs (0,1), (1,0) results in 
connectivity 4 and so on. Each node is given an integer excess/deficit distributed uniformly in the 
interval [—500 500]. A positive number means a source link and a negative number a sink link. All 
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Figure 6: (a) Example of a synthetic problem: a network of the size 6x6, connectivity 8, par- 
titioned into 4 regions. The source and sink are not shown, (b) Dependence on the interaction 
strength, for size 1000x1000, connectivity 8 and 4 regions. Plots show mean values and intervals 
containing 70% of the samples. 

edges in the graph are assigned a constant capacity, called strength. The network is partitioned 
into regions by slicing it in s equal parts in both dimensions. Thus we have 4 parameters: the 
number of nodes, the connectivity, the strength and the number of regions. We generate 100 
instances for each value of the parameters. 

Let us first look at the dependence on the strength shown in Figure 6(b). Problems with 
small strength are easy, because they are very local - long augmentation paths do not occur. For 
problems with large strength long paths needs to be augmented. However, finding them is easy 
because bottlenecks are unlikely. Therefore BK and S-ARD have a maximum in the computation 
time somewhere in the middle. It is more difficult to transfer the flow over long distances for 
push-relabel algorithms. This is where the global relabel heuristic becomes efficient and HIPR0.5 
outperforms HIPR0. The region-relabel heuristic of S-PRD allows it to outperform other push- 
relabel variants. 

In general, we think all such random 2D networks are too easy. Nevertheless, they are useful 
and instructive to show basic dependences. We now select the "difficult" point for BK with the 
strength 150 and study other dependencies: 

• The number of regions (Figure 7). For this problem family both the number of sweeps and 
the computation time grows slowly with the number of regions. 

• The problem size (Figure 8). Computation efforts of all algorithms grow proportionally. 
However, the number of sweeps shows different asymptotes. It is almost constant for S- 
ARD but grows significantly for S-PRD. 
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Figure 7: Dependence on the number of regions, for size 1000x1000, connectivity 8, strength 150. 
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Figure 8: Dependence on the problem size, for connectivity 8, strength 150, 4 regions. 



• Connectivity (Figure 9). Connectivity is not independent of the strength. Roughly, 4 edges 
with capacity 100 can transmit as much flow as 8 edges with capacity 50. Therefore while 
increasing the connectivity we also decrease the strength as (150 • 8) /connectivity in this 
plot. 

• Workload (Figure 7.1). This plot shows how much time each of the algorithms spends 
performing different parts of computation. Note that the problems are solved on a single 
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Figure 9: Dependence on the connectivity, for size 1000x1000, strength = (150 • 8) /connectivity, 
4 regions. 

computer with all regions kept in memory, therefor the time on sending messages should 
be understood as updates of dynamic data structure of the region w.r.t. the new labeling 
and flow on the boundary. For S-PRD more sweeps are needed, so the total time spent in 
messages and gap heuristic is increased. Additionally, the gap heuristic has to take into 
account all nodes, unlike only the boundary nodes in S-ARD. 
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Figure 10: Workload distribution, for size 1000x1000, connectivity 8, 4 regions, strength 150. 
msg - passing the messages (updating flow and labels on the boundary), discharge - work done 
by the core solver (BK for S-ARD and HPR for S-PRD), relabel - the region- relabel operation, 
gap - the global gap heuristic. 



7.2 Sequential Competition 

We tested our algorithms on the maxflow problem instances published by the Computer 
Vision Research Group at the University of Western Ontario 4 . The data consist of typical max- 
flow problems in computer vision, graphics, and biomedical image analysis. Stereo instances 
are sequences of subproblems (arising in the expansion move algorithm) for which the total time 

4 http : //vision. csd.uwo . ca/maxf low-data/ 
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should be reported. There are two models: BVZ [8j, in which the graph is a 4-connected 2D 
grid, and KZ2 [21], in which there are additional long-range links. Multiview 3D reconstruction 
models LB06 [2H] arid BL06 [TJ. Graphs of these problems are cellular complexes subdividing 
the space into 3D cubes and each cube into 24 smaller cells. Surface fitting instances LB07 [27J 
are 6-connected 3D grid graphs. And finally, there is a collection of volumetric segmentation 
instances BJ01 [5j, BF06 [1], BK03 [3] with 6-connected and 26-connected 3D grid graphs. 

To test our streaming algorithms, we used the regulargrid hint available in the definition of 
the problems to select the regions by slicing the problem into 4 parts in each dimension - into 16 
regions for 2D BVZ grids and into 64 regions for 3D segmentation instances. Problems KZ2 do 
not have such a hint (they are not regular grids), so we sliced them into 16 pieces just by the node 
number. The same we did for the multiview LB06 instances. Though they have a size hint, we 
failed to interpret the node layout correctly (the separator set, B, was unexpectedly large when 
trying to slice along the dimensions). So we sliced them purely by the node number. 

One of the problems we faced is pairing the arcs which are reverse of each other. While in 
stereo, surface and multiview problems, the reverse arcs are consequent in the files, and can be 
easily paired, in 3D segmentation they are not. For a generic algorithm, not being aware of the 
problem's regularity structure, it is actually a non-trivial problem requiring at least the memory 
to read all of the arcs first. Because our goal is a relative comparison, we did not pair the arcs 
in 3D segmentation. This means we kept twice as many arcs than necessary for those problems. 
In Table 1 this is seen, e.g. for babyf ace .n26cl00, which is 26-connected, but we construct a 
multigraph (has parallel arcs) with average node degree of 49. For some other instances, however, 
this is not visible, because there could be many zero arcs, e.g. liver. n26cl0 which is a 26- 
connected grid too, but has the average node degree of 10.4 with unpaired arcs. The comparison 
among different methods is correct, since all of them are given exactly the same multigraph. 

The results are presented in Table 1. We did measure the time of disk I/O, however it depends 
on the hard drive performance, other concurrently running processes as well as on system file 
caching (has effect for small problems) and therefore we report only bytes written/loaded. Note 
that disk I/O is not proportional to the number of sweeps, because some regions may be inactive 
during a sweep and thus skipped. For HIPR we do not monitor the memory usage. It is slightly 
higher than that of HPR, because of keeping initial arc capacities. 

For verification of solvers, we compared the flow values to the ground truth solution provided 
in the dataset. Additionally, we saved the cut output from each solver and checked its cost 
independently. Verifying the cost of the cut is relatively easy: the cut can be kept in memory and 
the edges can be processed form the DIMACS problem definition file on-line. An independent 
check of (pre-)flow feasibility would be necessary for full verification of a solver. However, it 
requires storing the full graph in memory and was not implemented. 

Table 1. Sequential Competition. CPU - the time spent purely for computation, excluding the time 
for parsing, construction and disk I/O. The total time to solve the problem is not shown. K - number 
of regions. RAM - memory taken by the solver; for BK in the case it exceeds 2GB limit, the expected 
required memory; for streaming solvers the sum of shared and region memory. I/O - total bytes read or 
written to the disk. 
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0.9GB 


22+16MB 




19GB 


37+17MB 




261GB 


babyface.n6cl00 5.1 


11.5 


13s 


71s 


65s 


87s 


24s 


19 


64 


74s 


191 


64 


250x250x81, 0.6GB 




1.0GB 






0.9GB 


22+16MB 




18GB 


37+17MB 




189GB 


adhead.n26cl0 12.6 


31.5 










128s 


17 


64 


224s 


109 


64 


256x256x192, 6.5GB 




6.2GB 








153+83MB 




84GB 


195+86MB 




0.8TB 


adhead.n26cl00 12.6 


31.6 










174s 


21 


64 


269s 


129 


64 


256x256x192, 6.7GB 




6.3GB 








153+83MB 




90GB 


196+86MB 




0.8TB 


adhead.n6cl0 12.6 


11.6 










36s 


13 


64 


119s 


161 


64 


256x256x192, 1.6GB 




2.5GB 








34+36MB 




31GB 


77+39MB 




372GB 


adhead.n6cl00 12.6 


11.7 










49s 


20 


64 


121s 


165 


64 


256x256x192, 1.6GB 




2.5GB 








34+36MB 




36GB 


77+39MB 




354GB 


bonc.n26cl0 7.8 


32.3 










25s 


12 


64 


96s 


148 


64 


256x256x119, 3.9GB 




4.0GB 








122+61MB 




35GB 


147+63MB 




470GB 


bone.n26cl00 7.8 


32.4 










29s 


14 


64 


68s 


124 


64 


256x256x119, 4.1GB 




4.0GB 








122+61MB 




39GB 


147+63MB 




321GB 


bone.n6cl0 7.8 


11.5 


7.7s 


5.7s 


17s 


12s 


7.2s 


9 


64 


37s 


195 


64 


256x256x119, 0.9GB 




1.5GB 






1.4GB 


62+23MB 




13GB 


52+25MB 




188GB 


Continued on next page 
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Table 1 — continued from previous page 



bonc.n6cl00 7.8 11.6 
256x 256x 1 1 9 1 OCR 


9.1s 
1.6GB 


9.1s 


22s 


14s 
1.5GB 


8.7s 10 64 
62-I-23MR 1 3GR 


23s 65 64 
e;2+25MR 104GR 


bone_subx.n6cl00 3.9 11.8 
128x256x119, 495MB 


7.1s 
0.8GB 


6.3s 


12s 


6.4s 
0.7GB 


5.5s 12 64 
39+12MB 7.1GB 


9.4s 42 64 
29+13MB 42GB 


bone_subxy.n26cl00 1.9 32.2 
128x128x119, 1.0GB 


5.9s 
1.0GB 


3.9s 


6.1s 


4.6s 
0.8GB 


7.3s 13 64 
92+851MB O.OMB 


8.7s 33 64 
50+16MB 39GB 


abdomcn_long.n6cl0 144.4 11.8 
512x512x551, 19GB 


29GB 








179s 11 
410+403MB 196GB 


> 35 

>1TB 


abdomen_short.n6cl0 144.4 11.8 
512x512x551, 19GB 


29GB 








82s 11 
410+403MB 138GB 





Our new algorithms computed flow values for all problems matching those provided in the 
dataset, except for the following cases: 

• LB07-bunny-lrg: no ground truth solution available (we found flow/cut of cost 15537565). 

• babyf acen26cl0 and babyf acen26cl00: we found higher flow values than those which were 
provided in the dataset (we found flow/cut of cost 180946 and 1990729 resp.). 

The latter problems appear to be the most difficult for S-ARD in terms of both time and number 
of sweeps. Despite this, S-ARD requires much fewer sweeps, and consequently much less disk I/O 
operations than the push-relabel variant. This means that in the streaming mode, where read and 
write operations take a lot of time, S-ARD is clearly superior. Additionally, we observe that the 
time it spends for computation is comparable to that of BK, sometimes even significantly smaller. 

Next, we studied the dependency of computation time and number of sweeps on the number 
of regions in the partition. We selected 3 representative instances of different problems and 
solved them using partitions into different number of regions. The results are presented in the 
Fig. 11, The instance BL06-gargoyle-sml was partitioned by the node number and the rest two 
problems were partitioned in 3D according to their grid sizes using variable number of slices in 
each dimension. These results shows that the computation time required is stable over a large 
range of partitions and the number of sweeps does not grow rapidly. Therefore, the partition for S- 
ARD can be selected to meet other requirements: memory consumption, number of computation 
units, etc. We should note however, that with refining the partition the amount of shared memory 
grows proportionally to the number of boundary edges. In the limit of single-vertex regions the 
algorithm will turn into a very inefficient implementation of pure push-relabel. 

7.3 Parallel Competition 

In this section we test parallel versions of our algorithms and compare them with two state- 
of-the-art methods. The experiments are conducted on the same machine as above (Intel Core 2 
Quad CPU@2.66Hz) but allowing the use of all 4 CPUs. The goal is to see how the distributed 
algorithms perform in the simplified setting when they are run not in the network but on a single 
machine. For P-ARD/PRD we expect that the total required work would increase compared to 
the sequential versions because the discharges are executed concurrently. The relative speed-up 
therefore would be sublinear even if we managed to distribute the work between CPUs evenly. 
The tests are conducted on small and medium size problems (taking under 2GB of memory). 
For P-ARD and P-PRD we use the same partition into regions as in Table 1, For other solvers, 
discussed next, we tried to meet better their requirements. 
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Figure 11: Dependence on the number of regions for the representative instances of multiview, 
stereo and segmentation. Top: CPU time used. Bottom: number of sweeps. 

DD. The dual decomposition method 5 [34] . The algorithm for the integer maxflow/mincut 
problem presented in [31] is a heuristic which has no guarantees. In particular, it is not guaranteed 
to terminate. The actual implementation uses additional randomization helping to "guess" the 
last bit of the solution. With this randomization switched off the algorithm did not terminate in 
1000 iterations on a simple example of 4 nodes. The algorithm does not scale well with the number 
of regions in the partition, it performs better with fewer regions. We tested it with partitions 
into 2 regions and 4 regions (denoted DDx2 and DDx4 resp.). Naturally, with 2 regions it can 
utilize only 2 CPUs. To our surprise, the algorithm terminated on all of the stereo problems in 
a small number of iterations. However, on larger problems partitioned into 4 regions it exceeded 
its internal bound of iterations (1000) in many cases and returned without optima flow/cut. In 
such a case it provides only an approximate solution to the problem. Whether such a solution is 
of practical value is beyond us. 

RPR. A recently published implementation of Region Push Relabel [TTJ by Sameh Khamis 
(vl.01, http://vision.csd.uwo.ca/code/). For RPR we constructed partition of the problem 
into smaller "blocks". Because regions in RPR are composed dynamically out of blocks (default 
is 8 blocks per region) we partitioned 2D problems into 64 = 8 2 blocks and 3D problems into 
512 = 8 3 blocks. This partitioning was also empirically faster than a coarser one. The parameter 
DischargesPerBlock was set by recommendation of authors to 500 for small problems (stereo) 
and to 15000 for big problems. The implementation is specialized for regular grids, therefore 



5 Multi-threaded maxflow library http : //www . maths . 1th . se/matematiklth/personal/petter/cppmaxf low . 
php 
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multiview and KZ2 problems which do not have regulargrid hint cannot be solved by this 
method. Because of the fixed graph layout in RPR, arcs which are reverse of each other are 
automatically grouped together, so RPR computes on a reduced graph compared to other methods. 
Let us also note that because of the dynamic regions, RPR is not fully suitable to run in a 
distributed system. 

The method [30] (parallel, but not distributed) would probably be the fastest one in this 
competition (as could be estimated from the results reported in [30]), however the implementation 
is not publicly available. 



Table 2: Parallel Competition 



problem 


BK [6] 


DDx2 


34 


DDx4 


EH 


P-ARD 


P-PRD 


RPR 






time 


time sweeps 


stereo 


BVZ-sawtooth(20) 


0.68s 


0.52s 


7 


0.37s 


ii 


0.30s 


7 


2.4s 


31 


4.8s 


274 


BVZ-tsukuba(16) 


0.36s 


0.28s 


6 


0.20s 8 


0.17s 


5 


1.5s 


33 


2.1s 


197 


BVZ-venus(22) 


1.2s 


0.84s 


7 


0.59s 


9 


0.50s 


7 


4.9s 


36 


8.0s 


466 


KZ2-sawtooth(20) 


1.8s 


1.2s 


11 


0.91s 


16 


0.96s 6 


4.9s 


23 




KZ2-tsukuba(16) 


1.1s 


0.67s 


7 


0.52s 


11 


0.70s 8 


4.9s 


22 




KZ2-venus(22) 


2.8s 


1.9s 


7 


1.3s 


12 


1.5s 


10 


10s 


39 




multiview 


BL06-camel-med 


25s 


18s 


221 


13s 


260 


8.7s 


14 


81s 


322 




BL06-camcl-sml 


0.98s 


0.63s 


11 


0.49s 


27 


0.49s 


10 


2.5s 


70 




BL06-gargoyle-lrg 


245s 


120s 


517 


mem 


58s 


23 


mem 




BL06-gargoyle-med 


115s 


59s 


20 


38s 


50 


27s 


21 


79s 


219 




BL06-gargoyle-sml 


6.1s 


3.0s 


19 


1.9s 


19 


1.6s 


10 


2.4s 


52 




surface 


LB07-bunny-mcd 


1.6s 


1.3s 


11 


1.1s 


11 


1.3s 


13 


12s 


35 


37s 


349 


LB07-bunny-sml 


0.17s 


0.12s 


11 


0.12s 


11 


0.21s 8 


0.58s 


21 


3.5s 


99 


segm 


liver. n6cl0 


7.2s 


X 7.6s 


1000 


X 22s 


1000 


8.9s 


23 


23s 


164 


5.1s 


1298 


liver. n6cl00 


15s 


17s 


31 


X 21s 


1000 


12s 


17 


23s 


102 


7.3s 


1722 


babyface.n6cl0 


6.8s 


8.8s 


61 


X 24s 


1000 


12s 


22 


61s 


135 


17s 


4399 


babyface.n6cl00 


13s 


16s 


338 


X 20s 


1000 


17s 


23 


61s 


179 


22s 


4833 


bone.n6cl0 


7.7s 


5.2s 


22 


X 8.2s 


1000 


4.9s 


17 


16s 


182 


6.3s 


918 


bone.n6cl00 


9.1s 


5.3s 


12 


4.1s 


17 


6.2s 


13 


14s 


70 


7.9s 


1070 


bone_subx.n6cl00 


7.1s 


6.3s 


24 


5.2s 


34 


3.9s 


17 


5.8s 48 


1.5s 


747 


bonc_subxy.n26c 1 00 


5.9s 


3.4s 


11 


3.2s 


12 


5.8s 


16 


6.0s 


37 


hang 



The results are summarized in table 2, The time reported is the wall clock time passed in the 
calculation phase, not including any time for graph construction. The number of sweeps for DD 
has the same meaning as for P-ARD/PRD, it is the number of times all regions are synchronously 
processed. RPR however is asynchronous and uses dynamic regions. For it we define sweeps = 
block_discharges/number_of _blocks. 

Comparing to Table 1, we see that P-ARD on 4 CPUs is about 1.5 — 2.5 times faster than 
S-ARD. The speed-up over BK varies from 0.8 on Iivern6cl0 to more than 4 on gargoyle. 

We see that DD gets lucky some times and solves the problem really quickly, but often it fails 
to terminate. We also observe that our variant of P-PRD (based on highest first selections rule) is 
a relatively slow, but robust distributed method. RPR, which is based on FIFO selection rule, is 
competitive on the 3D segmentation problems but is slow on other problems, despite its compile- 
time optimization for the particular graph structure. It is also uses relatively higher number of 
blocks, The version we tested always returned the correct flow value but often a wrong (non- 
optimal) cut. Additionally, for 26 connected bone_subxy .n26cl00 it failed to terminated within 
1 hour. 
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8 Region Reduction 



Some vertices become disconnected from the sink in the course of the studied algorithms. If 
they are still reachable from the source, they must belong to the source set of any optimal cut. 
Such vertices do not participate in further computations and the problem can be reduced by 
excluding them. Unfortunately, the opposite case, when a vertex must be strictly in the sink set 
is not discovered until the very end of the algorithm. 

The following algorithm attempts to identify as many nodes as possible for a given region. It 
is based on the following simple consideration: if a node is disconnected from the sink in G R as 
well as from the region boundary, B R , then it is disconnected from the sink in G; if a node is not 
reachable from the source in G R as well as from B R then it is not reachable from the source in G. 

Let us say that a node v is a strong source node (resp. a strong sink node) if for any optimal 
cut (C,C) v G C (resp. v G C). Similarly, v will be called a weak source node (resp. weak sink 
node), if there exist an optimal cut (C, C) such that v G C (resp. v G C). 

Kovtun [25J suggested to solve two auxiliary problems, modifying network G R by adding infinite 
capacity links from the boundary nodes to the sink and in the second problem adding infinite 
capacity links from the source to the boundary nodes. In the first case, if v is a strong source 
node in the modified network G R , it is also a strong source node in G. Similarly, the second 
auxiliary problem allows to identify strong sink nodes in G. It requires solving a maxflow problem 
on G R twice. We improve this construction by reformulating it as the following algorithm finding 
a single flow in G R . 



Algorithm 5: Region Reduction (G R , B 1 



/ * Input : network G R , boundary B 

1 Augment (s, t); 

2 B s := {v | v G B R , s — > v}; 

3 B T := {v | v G B R ,v t}; 

4 Augment (s, B s ); 

5 Augment (B T , t); 

6 foreach v G R do 



10 

li 



if s — > v then v is strong source node; 
if v — > t then v is strong sink node; 
otherwise 

if v B then v is weak source node; 
if B v then v is weak sink node 



*/ 

/* source boundary set */ 
/* sink boundary set */ 



Statement 11. Sets B s and B T constructed in step 2 are disjoint. 

Proof. We have s t after step 1, hence there cannot exist simultaneously a path from s to v 
and a path from v to t. □ 

After step 1 the network G R is split into two disconnected networks: with nodes reachable from 
s and nodes from which t is reachable. Therefore any augmentations occurring in steps 4 and 
5 act on their respective subnetworks and can be carried independently of each other. On the 
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output of Alg. 5 we have: 
in Fig. 12, 



s 



B U {t} and B U {s} t. The classification of nodes is shown 





(a) (b) 

Figure 12: Classification of nodes in V R build by Alg. 5, Nodes reachable from s are strong source 
nodes. Nodes from which t is reachable are strong sink nodes. The remaining nodes can be 
classified as weak source (a) if they cannot reach boundary, or as weak sink (b) if they are not 
reachable from the boundary. Some nodes are both: weak source and weak sink, this means they 
can be on both sides of an optimal cut (but not independently). 

Augmenting on (s, t) in step 1 and on (s, B s ) in the step 4 is the same work as done in ARD 
(where (s, B ) paths are augmented in the order of labels of B s ). This is not a coincidence, these 
algorithms are very much related. However, the augmentation on (B T ,t) in step 5 cannot be 
executed during ARD. It would destroy validity of the labeling. It may only be executed during 
the first sweep of S-ARD, as an initialization. Otherwise, we may consider Algorithm 5 as a 
general preprocessing. 

If v is a weak source node, it follows that it is not a strong sink node. In the preflow pushing 
algorithms we find the cut (T,T), where T is the set of all strong sink nodes in G. We consider 
that v is decided if it is a strong sink or a weak source node. 



BVZ-sawtooth(20) 80.0% 

BVZ-tsukuba(16) 72.8% 

BVZ-vcnus(22) 70.2% 

KZ2-sawtooth(20) 85.0% 

KZ2-tsukuba(16) 69.9% 

KZ2-venus(22) 75.8% 

BL06-camel-lrg 2.0% 

BL06-camel-med 2.3% 

BL06-camel-sml 4.6% 

BL06-gargoyle-lrg 6.0% 

BL06-gargoyle-med 2.4% 

BL06-gargoyle-sml 9.8% 

LB07-bunny-lrg 11.4% 

LB07-bunny-med 13.1% 



LB07-bunny-sml 

liver. n26cl0 

liver. n26cl00 

liver. n6cl0 

liver. n6cl00 

babyface.n26cl0 

babyface.n26cl00 

babyface.n6cl0 

babyface.n6cl00 

adhcad.n26cl0 

adhead.n26cl00 

adhead.n6cl0 

adhead.n6cl00 

bonc.n26cl0 



15.6% bone.ri26cl00 6.9% 

7.1% bone.n6cl0 8.8% 

5.3% bone.n6cl00 7.0% 

7.2% bone_subx.n26cl0 6.6% 

5.3% bone_subx.n26cl00 6.6% 

29.3% bone_subx.n6cl0 6.3% 

30.9% bone_subx.n6cl00 6.3% 

35.4% bone_subxy.n26cl0 6.6% 

33.7% bone_subxy.n26cl00 6.6% 

0.3% bone_subxy.n6cl0 6.4% 

0.3% bone_subxy.n6cl00 6.3% 

0.2% bone_subxyz.n26cl0 6.6% 

0.1% bone_subxyz.n26cl00 6.6% 

8.7% bone_subxyz.n6cl0 6.6% 



bone_subxyz.n6cl00 6.6% 

bone_subxyz_subx.n26cl0 7.9% 

bone_subxyz_subx.n26cl00 6.6% 

bone_subxyz_subx.n6cl0 8.2% 

bone_subxyz_subx.n6cl00 6.6% 

bone_subxyz_subxy.n26cl0 11.3% 

bone_subxyz_subxy.n26cl00 9.5% 

bone_subxyz_subxy.n6cl0 12.7% 

bone_subxyz_subxy.n6cl00 9.3% 

abdomcnJong.n6cl0 1.7% 

abdomen_short.n6cl0 6.3% 



Table 3: Percentage of nodes which can be decided by preprocessing. The problems are partitioned 
into regions the same way as in Table 1 , For stereo problems the average number over subproblems 
is shown. 

Table 3 gives the percentage of how many vertices are decided (and hence can be excluded from 
the problem) by Algorithm 5 for computer vision problems. It is seen that in stereo problems, 
a large percent of vertices is decided. These problems are rather local and potentially can be 
fully solved by using Algorithm 5 several times in overlapping windows. In contrast, only a small 
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fraction can be decided locally for the other problems. 

9 Appendix A: Tightness of 0(n 2 ) bound for PRD. 

Here we give an example of a network, its partition into regions and a sequence of valid push 
and relabel operations, implementing PRD, such that sequential region discharge requires 0(n 2 ) 
sweeps. Because there is only one active node at any time, it also applies to parallel PRD. 

We start by an auxiliary example, in which the preflow is transfered from a vertex to a boundary 
vertex with a higher label. In this example some inner vertices of a region are relabeled, but not 
the boundary vertices. Therefore the total number of sweeps cannot be bounded by the number 
of relabellings of the boundary vertices only. 

Example 1. Consider a network of 6 regular nodes in Figure 13. Assume all edges have infinite 
capacity, so only non-saturating pushes occur. There are two regions -Ri = {1,2,3,4,5} and 
i?2 = {6}. Figure 13 shows a sequence of valid push and relabel operations. We see that some 
nodes were raised due to relabel, but the net effect is that flow excess from node 1 was transfered 
to node 6, which has a higher label. Moreover none of the boundary nodes (nodes 1,5,6) were 
relabeled. 



Ri Ri 




(«) (&) (c) (d) (e) (/) 

Figure 13: Steps of Example 1. Node's height correspond to its label. Black box shows the node 
with excess. The source node and think node are not shown, (a) (b) flow excess is pushed to node 
2; (c) node 2 is relabeled, so that two pushes are available and excess is pushed to node 3; (d-f) 
similar. 

Example 2. Consider the network in Figure 14, The first step corresponds to a sequence of push 
and relabel operations (same as in Figure 13) applied to the chain (1, 2a, 3a, 4a, 5, 6). Each next 
step starts with the excess at node 1. Chains are selected in turn in the order a, 6, c. It can be 
verified from the figure that each step is a valid possible outcome of PRD applied first to R± and 
then to i?2- The last configuration repeats the first one with all labels raised by 2, so exactly the 
same loop may be repeated many times. 

It is seen that nodes 1,5,6 are relabeled only during pushes on paths a and b and never during 
pushes on path c. If there were more paths like path c, it would take many iterations (= number 
of region discharge operations) before boundary vertices are risen. Let there be k paths in the 
graph (d, e. . . ), handled exactly the same way as path c. The number of nodes in the graph is 
n = O(k). It will take 0(n) region discharge operations to perform each loop, raising all nodes 
by 2. Therefore until node 1 reaches the maximal label it will take 0(n 2 ) steps. 

Because there is only one active node at any time, this example is independent of the rule used 
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Figure 14: Steps of Example 2. Top left: a network with several chains of nodes like in Example 
1. Nodes 1,5,6 are common for all chains but there are separate copies of nodes 2,3,4 denoted 
by letters. In addition, there is a reverse arc from node 6 to node 1. From left to right, top to 
bottom: one step of transferring a flow from node 1 to node 6 using one of the chains and then 
pushing it through the arc (6,1), relabeling 6 when necessary. 



to select the active node (highest label selection rule or FIFO rule). Also, noting that the number 
of boundary nodes is 3, we see that our S-ARD algorithm will terminate in a constant number of 
sweeps for arbitrary k. 

10 Appendix B: Relation to Dual Decomposition 

In our approach we partition the set of vertices into regions and couple the regions by sending 
the flow through the inter-region edges. In the dual decomposition for mincut |33] detailed below, 
a separator set of the graph is selected, each subproblem gets a copy of the separator set and the 
coupling is achieved via the constraint that the cut of the separator set must be consistent across 
the copies. We now show how the dual variables in [31] can be interpreted as flow, thus relating 
approach [M] to ours. 
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(d) 

Figure 15: Interpretation of dual decomposition, (a) Example of a network with denoted capaci- 
ties. Terminal capacities are shown in circles, where "+" denotes s-link and "— " denotes t-link. 
M fl N is a separator set. (b) The network is decomposed into two networks holding copies of the 
separator set. The associated capacities are divided (not necessarily evenly) between two copies. 
The variable Ai is the Lagrangian multiplier of the constraint x v = y v . (c) Introducing edges of 
infinite capacity enforces the same constraint, that v' and v" are necessarily in the same cut set 
of any optimal cut. (d) A maximum flow in the network (c), the flow value on the red edges 
corresponds to the optimal value of dual variables A. 

Decomposition of the mincut problem into two parts is formulated in [31] as follows. Let 
M, N C V are such that M U N = V, {s, t} C M D N and there are no edges in E from M\N to 
N\M and vice-versa. Let x: M — > {0, 1} and y: N — > {0, 1} be the indicator variables of the cut 
set, where corresponds to the source set. Then mincut problem can be reformulated as: 

mm C M (x) + C N (y), 

{x s = x s = 0, ( 14 ) 
x t = y t = 1, 
xt = y u Vi e M n N, 

where 



C M (x)= ^(hm-^xt, (15a) 

C N (y)= '"'('•./!( 1 <l ■)<!■'■ (15b) 

c m {i,3) + c n {i,3) = c{i,j), (16a) 

c M (i,j) = Vi,j G N\M, (16b) 

c rf (i,j) = Vz,j'GM\AT, (16c) 
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E M = (M, M) = (MxM) fl E and E N = (N, N). The minimization over x and y decouples once 
the constraint Xi = yi is absent. The dual decomposition approach is to solve the dual problem: 



max 

A 



rain(c M (x) + A;(l - + nun (c N (y) - A*(l - y s )y t > j • (17) 

x a =0 ieMUN x a =0 ieMUN 

x t = l x t =l 



We observe that dual variables A correspond to flow on the artificial edges of infinite capacity 
between variable x« and y^ like it is explained in Fig. 15. For a problem with integer capacities 
there exist an integer optimal flow. This observation provides an alternative proof of [34, Theorem 
2], stating that there exist an integer optimal A. 

The algorithm we introduced could be applied to such a decomposition by running it on the 
extended graph (c./. Fig. 15(c)), where vertices of the separator set are duplicated and linked by 
additional edges of infinite capacity. It could be observed, however, that this construction does 
not allow to reduce the number of boundary vertices or the number of inter-region edges, while 
the size of the regions increases. Therefore it is not beneficial with our approach. 



11 Conclusion 



We developed a new algorithm for mincut problem on sparse graphs, which combines augment- 
ing paths and push-relabel approaches. The main result of this work is the worst case complexity 
guarantee of 0(|i3| 2 ) sweeps for the sequential and parallel variants of the algorithm (S/P-ARD). 
We also gave a novel parallel version of the region push-relabel algorithm of [11], an improved 
algorithm for local problem reduction (Sect. 8) and a number of auxiliary results. 

Both is theory and practice (randomized tests) S-ARD has a better asymptote in the number 
of sweeps than the push- relabel variant. Experiments on real instances showed that when run 
on a single CPU and the whole problem fits into memory, S-ARD is comparable in speed with 
the non-distributed BK implementation, being even significantly faster in some cases. When only 
one region is loaded into memory at a time, S-ARD used much fewer disk I/O than S-PRD. We 
also demonstrated that the running time and the number of sweeps are very stable with respect 
to the partition of the problem into up to 64 regions. In the parallel mode, using 4 CPUs, P- 
ARD achieves a relative speedup of about 1.5 — 2.5 times over S-ARD and uses just slightly 
larger number of sweeps. P-ARD compares favorably to other parallel algorithms, being a robust 
method suitable for a use in a distributed system. 

Our algorithms are implemented for generic graphs. Clearly, it is possible to specialize the 
implementation for grid graphs, which would reduce the memory consumption and might reduce 
the computation time as well. 

A practically useful mode could be actually a combination of parallel and sequential, when 
several regions are loaded into the memory at once and processed in parallel. There are several 
particularly interesting combinations of algorithm parallelization and hardware, which may be 
exploited: 1) parallel on several CPUs, 2) parallel on several network computers, 3) sequential, 
using Solid State Drive, 4) sequential, using GPU for solving region discharge. 

There is the following simple way how to allow region overlaps in our framework. Consider a 
sequential algorithm, which is allowed to keep 2 regions in memory at a time. It can then load 
pairs of regions (1, 2), (2, 3), (3, 4). . . , and alternate between the regions is a pair until both are 
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discharged. With PRD this is efficiently equivalent to discharging twice larger regions with a 1/2 
overlap and may significantly decrease the number of sweeps required. In the case of a 3D grid, it 
would take 8 times more regions to allow overlaps in all dimensions. However, to meet the same 
memory limit, the regions have to be 8 times smaller. It has to be verified experimentally whether 
it is beneficial. The RPR implementation of [H] uses exactly this strategy, a dynamic region is 
composed out of a number of smaller blocks and blocks are discharged until the whole region is 
not discharged. It is likely that with this approach we could further reduce the disk I/O in the 
case of the streaming solver. 
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